programming perspectives

Structuring Docker files to improve build times

I was browsing through some of my old ADRs recently was reminded of some research I'd done into building Docker images. In this case I was trying to create a container image for building an older application, which required third party dependencies and lots of tweaking. This meant that I was spending a lot of time tweaking the Docker file and even more time waiting for the image to build, which quickly became very frustrating. In my case I managed to reduce build times significantly by changing how I was structuring my Docker file - this article is a discussion of my thinking

This post isn't intended to be an introduction to this technology - there's a great introduction to Docker on their site if you need one

What's the problem?

When you issue a docker build command Docker is going to follow the instructions in your Docker file and build an image. Let's think about the following set of instructions:

ADD ".\foo.exe" .
RUN start /wait .\foo.exe
RUN del foo.exe
ADD ".\bar.ps1" .

Firstly, we add the file foo.exe to the image, then we run it, then we delete it. Finally, we add the bar.ps1 file. This should all build reasonably quickly. Now let's imagine that foo.exe is 500Mb and that running the install takes 15 minutes. Each time you want to tweak the contents of bar.ps1 and rebuild you're going to have to wait 15 minutes for the build to complete. Enter caching...

Caching

Obviously, we don't want to be waiting around all that time for a simple script change, so Docker uses build time caching for ADD, COPY, and RUN commands. Let's walk through how this works using the previous example

Add foo.exe to the image and add the result to the cache

ADD ".\foo.exe" .

Run foo.exe and add the result to the cache

RUN start /wait .\foo.exe

Delete foo.exe and add the result to the cache

RUN del foo.exe

Add bar.ps1 to the image and add the result to the cache

ADD ".\bar.ps1" .

So now what happens if we change bar.ps1 and rebuild? Docker will recognise that foo.exe hasn't changed and so reuse the first three layers, only needing to rebuild the final one

Layering considerations

Overall, this caching strategy does help to massively improve build performance - so we're done, right? Not quite. I mentioned the word cache earlier and as we all know, caches get invalidated. In the case of Docker this is when previous layers change. Let's consider this alternative layout of my original Docker file, where I've moved the ADD ".\bar.ps1" . command to the top

ADD ".\bar.ps1" .
ADD ".\foo.exe" .
RUN start /wait .\foo.exe
RUN del foo.exe

Obviously, each command is going to create a layer in the cache - as we discussed earlier - and if I change bar.ps1 every layer that comes after it will be invalidated and need to be rebuilt. Given this, we still need to consider how we structure our Docker files

Let's consider the following file layout:

# layer 1
ADD ".\foo.ps1" .

# layer 2
ADD ".\bar.exe" .

# layer 3
RUN start /wait powershell .\foo.ps1

# layer 4
RUN start /wait bar.exe /q

# layer 5
RUN del foo.ps1

# layer 6
RUN del bar.exe

As a long time software engineer this was my initial approach - grouping operations together by type: adding files, running the commands, deleting the files. As we've just learnt, each of these operations will generate a new layer in the cache and any change to bar.exe, for example, will invalidate all of the layers that follow it. Pushing us back to longer build times

So now let's think about an alternative layout, where we are grouping around logical layering of the image: copy, install, and delete foo; copy, install, and delete bar. Changing bar.exe no longer invalidates the earlier layers - so the cached versions can be reused at build time

# grouping 1
ADD ".\foo.ps1" .
RUN start /wait powershell .\foo.ps1
del foo.ps1

# grouping 2
ADD ".\bar.exe" .
RUN start /wait bar.exe /q
del bar.exe

You can also optimise a little further, as I did, to reduce the overall cache layers. In this case I ended up with just four:

# escape=`

# layer 1
ADD ".\foo.ps1" .

# layer 2
RUN start /wait powershell .\foo.ps1 `
&& del foo.ps1

# layer 3
ADD ".\bar.exe" .

# layer 4
RUN start /wait bar.exe /q `
&& del bar.exe

In summary

Obviously, this isn't a silver bullet, pretty much everything in software engineering is a trade off. The key take away here should be that build caching exists and you can use it to your advantage. You just need to think about your Docker file, and the nature of the image you are building, to leverage it to its fullest. If you want to do some in depth reading there's some great documentation on the Docker site that goes into more detail

What next?

I don't have a comment section on my blog at the moment, but I'm always happy to chat on Mastadon